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ABSTRACT 

Web search logs contain extremely sensitive data, as evidenced by 
the recent AOL incident. However, storing and analyzing search 
logs can be very useful for many purposes (i.e. investigating human 
behavior). Thus, an important research question is how to privately 
sanitize search logs. Several search log anonymization techniques 
have been proposed with concrete privacy models. However, in 
all of these solutions, the output utility of the techniques is only 
evaluated rather than being maximized in any fashion. Indeed, for 
effective search log anonymization, it is desirable to derive the op- 
timal (maximum utility) output while meeting the privacy standard. 
In this paper, we propose utility-maximizing sanitization based on 
the rigorous privacy standard of differential privacy, in the context 
of search logs. Specifically, we utilize optimization models to max- 
imize the output utility of the sanitization for different applications, 
while ensuring that the production process satisfies differential pri- 
vacy. An added benefit is that our novel randomization strategy 
ensures that the schema of the output is identical to that of the in- 
put. A comprehensive evaluation on real search logs validates the 
approach and demonstrates its robustness and scalability. 

Keywords: Search Logs, Differential Privacy, Optimization 

1. INTRODUCTION 

Search engines are used by millions, if not billions, of people 
every day. The queries posed by the users form a large volume of 
data that can give great insight into human behavior via their search 
intent. Indeed, such data is invaluable for researchers and data ana- 
lyzers in numerous fields 1111 . For example, search engines them- 
selves can use web search logs to identify common spelling errors, 
to recommend similar queries, or to expand queries. Many other 
applications also make use of search log data, such as the analysis 
of living habits from daily search, and the detection of epidemics 
|9 |. For this reason, search log data is collected, stored, and ana- 
lyzed in different ways by all search engines. 

However, one problem with the storage and release of search log 
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data is the potential for privacy breach. The queries that a user 
poses may sometimes reveal their most private interests and con- 
cerns. Thus, if search log data is published without sanitization or 
with trivial anonymization (such as simply replacing user ids by 
pseudonyms), many sensitive queries and clicks can be explicitly 
acquired by adversaries. (3) 11 II demonstrates that it can only take 
a couple of hours to breach a particular user's privacy in the absence 
of good anonymization. Thus, it is crucial to anonymize search log 
data appropriately before storing or releasing it. 

There has been significant work on database anonymization that 
looks at how to anonymize relational data. However, much of this 
work is not directly applicable since there are significant differ- 
ences between search logs and relational data. Indeed, search logs 
pose additional challenges for anonymization. First, there is no 
explicit distinction between quasi-identifiers and sensitive infor- 
mation in search logs. Each user may pose hundreds of queries 
that involve lots of personal information (i.e. name, addresses, liv- 
ing habits, .etc) over a short period of time. By combining these 
queries, adversaries may easily discover an individual's identity, 
and it is difficult to foresee all possible combinations that can lead 
to privacy breaches. For instance, Table[T]illustrates a subset of an 
Internet user Alice's search log (note: the user-IDs can be deter- 
mined by cookies, IP addresses or user accounts; we ignore query 
time and item rank of search logs in this paper). Although the real 
user- ID has been replaced by the pseudonymous ID 000101, the 
adversaries can still identify Alice's search log if they have some 
background knowledge on Alice (i.e. her address, she bought a 
second-hand Honda car via autotrader recently, she likes pizza), 
and thus learn more sensitive information (i.e. pregnancy test) from 
Alice's complete search log. Second, search logs are sparse and 
highly-dimensional, thus it is more difficult to guarantee rigorous 
privacy without sacrificing too much utility. 



Table 1: An Example of Search Logs 



User-ID 


Query 


URL 


Count 


000101 


1 Washington Avenue 


maps. google. com 


5 




Honda 


www.honda.com 


2 




autotrader 


www.autotrader.com 


4 




pizza 


www.pizzahut.com 


1 




pregnancy test 


www.medicinenet.com 


1 



In recent years, several search log anonymization techniques have 
been proposed in the literature to resolve the above problems 1201 
I5l ll7|[l4lll5l|19ll23l . Several anonymity models have been pro- 
posed for this domain along with corresponding anonymization al- 
gorithms. However, their basic premise is simply that the algorithm 



must satisfy the privacy requirements without worrying about the 
tradeoff between privacy and utility. Ideally, what is needed is a 
strategy that can maximize the utility while satisfying a given pri- 
vacy requirement. To our knowledge, there is little work focusing 
on this challenging and practical problem. In this paper, we take the 
first step towards tackling this problem in the domain of search log 
anonymization by formulating utility-maximizing problems while 
ensuring a rigorous privacy standard. 

1.1 Contribution 

Given a particular privacy requirement, the utility-maximizing 
problem requires finding a way to anonymize search logs in a man- 
ner that satisfies the privacy standard and simultaneously achieves 
the optimal output utility. This requires deciding on a suitable pri- 
vacy requirement as well as appropriate data utility measure. While 
several different anonymity models have been proposed in the lit- 
erature, in this paper, we utilizes the robust privacy definition of 
differential privacy (7) (which lowers the privacy breach risk even 
if the adversaries hold arbitrary prior knowledge). We also define 
several different notions of utility and propose differentially private 
sanitization methods that can maximize the output utility. Thus, the 
main contributions of this paper are summarized as follows: 

• The differentially private randomization in prior work (Ko- 
rolova et al. |19] and Gotz et al. 1101 ) ensures differential pri- 
vacy by adding Laplacian noise to the aggregated query and 
clicked url counts. However, such approaches break the as- 
sociation between distinct query-url pairs in the output since 
all the user-IDs have been removed, which might be useful 
in only a few applications. Therefore, we propose differen- 
tially private algorithms based on a different randomization 
strategy: sample user-IDs for every click-through query-url 
pairs using multinomial distribution, which preserves user- 
IDs. This, to our knowledge, is the first randomization strat- 
egy to generate output with identical schema as the input 
search log. Thus, the sanitized search log can be analyzed 
in exactly the same fashion and for the same purpose as the 
input. 

• Within our approach, the randomization algorithm also en- 
sures the utility-maximized output that is still differentially 
private. To do this, we formally define the utility-maximizing 
problem: find an optimal sanitization that maximizes the out- 
put utility while satisfying differential privacy. Specifically, 
for quantifying the output utility, we define three different 
utility notions (measuring the utility of frequent click-through 
query-url pairs, the query-url pair diversity, etc.) that could 
benefit different applications (essentially, any utility measure 
can be coupled into our differentially private sanitization by 
replacing the utility objective function). We also prove that 
our sanitization satisfies differential privacy; 

• We transform the utility-maximizing problems into standard 
optimization problems. We can now leverage prior devel- 
oped effective solvers and adapt them to our problem. We 
experimentally validate the utility using real data sets. 

The remainder of this paper is organized as follows. Section 
previews the related literature. In Section [3] we present our pri- 
vacy model and the sanitization process. Section [4] introduces the 
constraints that guarantee privacy protection. We then formulate 
three different utility-maximizing problems and show that the cor- 
responding sanitization methods are differentially private in Sec- 
tion[5] Section[6]evaluates the output utility of the proposed saniti- 
zation approaches. Finally, Section|7]concludes the paper. 



2. RELATED WORK 

2.1 Search Log Anonymization 

Following the AOL search log incident, there has been some 
work on user privacy issues related to privately publishing search 
logs. Adar (T) proposes a secret sharing scheme where a query 
must appear at least t times before it can be decoded. It may poten- 
tially remove too many harmless queries, thus reducing data utility. 
Kumar et al. 1 20 1 propose an approach that tokenizes each query tu- 
ple and hashes the corresponding search log identifiers. However, 
inversion cannot be done using just the token frequencies. Also, 
serious leaks are possible even when the order of tokens is hidden. 

More recently, some anonymization models |19II14| [T5 , 23] have 
been developed for search log release. He et al. (T4|, Hong et 
al. (T5 | and Liu et al. |23 | anonymized search logs based on k- 
anonymity which is not as rigorous as differential privacy [ 10 1. Ko- 
rolova et al. 1 19 1 first applied the rigorous privacy notion - differen- 
tial privacy to search log release by adding Laplacian noise. How- 
ever, several shortcomings can be discovered in this work. First, 
the released result of this is the statistical information of queries 
and clicks where all users' search queries and clicks are aggregated 
together (without individual attribution). The data utility might be 
greatly reduced since the association between query-url pairs has 
been removed (the published data in Gotz et al. 1 10] also suffers 
this constraint). With the released data, we cannot develop person- 
alized query suggestion or recommendation for search engines, and 
also, we cannot carry out human behavior research since the output 
data do not include the information that any two queries belong to 
the same user. Second, as addressed by Gotz et al. [ 10], the relaxed 
differential privacy notion in 1 19 1 is not sufficiently strong. Third, 
the utility in 1 19 1 is merely evaluated but not shown to be maxi- 
mized. Adding Laplacian noise to the counts of selected queries 
and urls is straightforward and we cannot directly model optimiza- 
tion problems to maximize the output utility. Alternatively, our pa- 
per is to seek the maximum output utility for a novel differentially 
private search log sanitization mechanism which generate outputs 
with the identical schema as the original search log. 

Furthermore, Gotz et al. 1101 analyzes algorithms of publishing 
frequent keywords, queries and clicks in search logs and conducts 
a comparison w.r.t. two relaxations of e-differential privacy (re- 
laxations are indispensable in search log publishing). Our work 
utilizes the stronger relaxation of e-differential privacy - proba- 
bilistic differential privacy. Since we explore the optimal utility 
in our differentially private sanitization mechanism which outputs 
search logs rather than the results of counting queries and clicked 
urls over the search log, our work has a completely different focus, 
compared with their work 1101 . 

2.2 Differential Privacy 

In the context of relational data anonymization, Dwork et al. 1 6 
7 1 have proposed the rigorous privacy definition of differential pri- 
vacy: a randomized algorithm is differentially private if for any pair 
of neighboring inputs, the probability of generating the same out- 
put, is within a small multiple of each other. This means that for any 
two datasets which are close to one another, a differentially private 
algorithm will behave approximately the same on both data sets. 
This notion provides sufficient privacy protection for users regard- 
less of the prior knowledge possessed by the adversaries. This has 
been extended to data release in various different contexts besides 
search logs (i.e. contingency tables, graph data). Specifically, Xiao 
et al. 1 27] introduced a data publishing technique which ensures 
e-differential privacy while providing accurate answers for range- 
count queries. Hay et al. 1131 presented an efficient algorithm for 



releasing a provably private estimate of the degree distribution of 
a network where it also satisfies the differential privacy. McSh- 
erry et al. 1251 solved the problem of producing recommendations 
from collective user behavior while providing differential privacy 
for users. Our work follows the same line of research. 

2.3 Tradeoff between Privacy and Utility 

For any data modification based anonymization technique, a trade- 
off between privacy and utility naturally holds. Li et al. 1221 an- 
alyzed the fundamental characteristics of privacy and utility, and 
proposed a tradeoff framework for discussing privacy and utility. 
In microdata disclosure, Bayardo et al. |4| and LeFevre et al. 1211 
raised the optimal k-anonymity and the optimal multidimensional 
anonymization problem respectively. Kifer et al. 1181 presented a 
way to gain additional utility from k-anonymous and 1-diverse ta- 
bles. Recently, Ghosh et al. (8) introduced a utility maximizing 
mechanism for releasing a statistical database. However, there is 
little work on this topic in the context of differential privacy guar- 
anteed search log release. To our knowledge, we takes a first step 
towards addressing this deficiency. 

3. MODEL 

3.1 Differential Privacy 

Our objective is to privately sanitize the input search logs that 
includes pseudonymous user-IDs, search queries, clicked urls and 
the counts of every user's click-through query-url pairs. Hence, 
we ensure that the output has the identical schema as the input: 
every single tuple in the output includes a pseudonymous user-ID, 
a click-through query-url pair and its count for this user. 

We consider two search logs to be neighbors if they differ by an 
arbitrary user's (all) query tuples. Hence, we define every user's all 
query tuples in a search log D as its user log. 

Definition 1. (User Log A k ) Given a search log D, we 
denote each user Sk 's user log Ak as all his/her query tuples in 
D, where every single tuple [sk, qt,Uj , Cijk] G Ak includes a 
pseudonymous user-ID (sk), a query (qi), a url (uj) and the count 
(cjjfe) of query-url pair (qi, Uj) belonging to user Sk- 

Clearly, every search log D consists of numerous individual user 
logs (D = Uvs fe ei3 Given two neighboring input search logs 
D and D' (w.o.l.g, D = D' + Ak), ensuring e-differential privacy 
for all the outputs might be impossible: for any output O including 
items in D but not in D' (such as user-ID Sk), the probability that 
generating O from D' is zero but from D is non-zero, hence the 
ratio between the probabilities cannot be bounded by e £ (due to a 
zero denominator). We thus adopt the following relaxed notion of 
differential privacy (using our notations): 

Definition 2. ((e, ^-probabilistic differential privacy 
H24\\10]I )A randomization algorithm 1Z satisfies (e, 8) -probabilistic 
differential privacy if for any input search log D, we can divide the 
output space S7 into two sets S7i, Q2 such that 

(l)Pr\K(D) eQi]< 8, and 
for D 's all neighboring search logs D' and for any output O G SI 2 ■' 

, ? , Pr[K(D) = 0] < 
yA > Pr[TL(D')=0] — 



e una p,.[ K ( D ) = o] — e 



output and D' does not contain Sk, we can only consider S7i as 
the output space where all outputs in S7i include user- ID Sk (be- 
cause e-differential privacy cannot be achieved when D' , D dif- 
fering in user Sk's user log Ak and the output O including Sk). 
Hence, the probability Pr[TZ(D) G Or] should be no greater than 
8 (the probability of Sk existing in the overall output space S7 should 
be bounded by 8). Moreover, for any output O G Q2, two ratios 
should be bounded by e E for achieving e-differential privacy. Def- 
inition^ has been proven to be stronger than the privacy notion of 
Korolova et al.'s work 1 19 1 (indistinguishability differential privacy 
t6l ) by Gotz et al. 1 10] (as also shown in Section l4~3l >, 

All the sanitization methods addressed in this paper are required 
to satisfy this robust and rigorous privacy definition. No matter 
how much prior knowledge is owned by adversaries, we can lower 
the privacy risk by bounding the probabilities that any arbitrary two 
neighboring inputs produce any possible output. 

3.2 Search Log Sanitization Process 

With a rigorous privacy standard (Definition [2}, our goal is to 
maximize the retained utility for the sanitized search logs. We now 
illustrate our search log sanitization process that integrates the sat- 
isfaction of differential privacy and utility maximization. 

The most sensitive values in search logs are the click-through in- 
formation. Sometimes search queries may be more sensitive than 
the clicked mis in search logs (i.e. query "diabetes medicine" and 
click "www.walmart.com"), or vice versa (i.e. query "medicine" 
and click "www.cancer.gov"). We thus consider each distinct click- 
through query-url pair (simply denoted as query-url pair) as a com- 
bination of the sensitive values in the search logs. In our privacy 
model, Definition[2]ensures that adding any user's all search infor- 
mation (user-ID, query-url pairs and the counts) in the input does 
not cause any additional risk. 

Table [2] presents some frequently used notations in our model: 
we denote Cij as the input count of any query-url pair (qi,Uj) and 
the set of these counts {Vcij} constitutes the input query-url his- 
togram. Similarly, Xij represents the output count of (qi, Uj) and 
the set of these counts x — {ix-ij } forms the output query-url his- 
togram. Finally, the output counts of all triplets (qi, Uj, Sk) form 
the output query -url-user histogram which is randomly sampled 
(the sampling process will be given later on). Similarly, the deter- 
ministic counts of all triplets (qi , Uj , Sk) in the input form the input 
query-url-user histogram. 

Table 2: Frequently Used Notations 



(qi,uj) 


an arbitrary query-url pair in the input/output 


(qi, Uj,Sk) 


any user s^'s arbitrary query-url pair (qi, Uj) 


Cij 


the total count of (qi , Uj ) in the input 


Cijk 


the count of triplet (qi, Uj,Sk)'m the input 


Xij 

(variable) 


the total count of (qi , Uj ) in the output 
(in the optimal solution: x*A 


Xijk (random 
variable) 


the count of triplet (qi,Uj , Sk) in a sample output (x*^ k 
is the count of (qi,Uj , s^) if sampling with x*^ trials) 



The above probabilistic differential privacy ensures that 7Z satis- 
fies e-differential privacy with high probability (no less than 1 — 8) 
1101 . In this definition, the set S7i includes all privacy-breaching 
outputs for e-differential privacy where the probability of gener- 
ating such outputs is bounded by 8. Specifically in our sanitiza- 
tion (w.o.l.g. D = D' + Ak), since we retain user IDs in the 



Algorithm Q] illustrates two steps of our sanitization. We first 
compute the optimal output counts for all the query-url pairs in the 
input search log D, and then generate the output O by sampling 
user- IDs for each of them with multinomial distribution |2| (the de- 
tails of this multinomial sampling are given later on). More specifi- 
cally, the algorithm can be guaranteed to be differentially private by 
some constraints for the output counts of all query-url pairs {Va;ij} 
(we can derive the constraints from the randomization, as shown 
in Section|4]l. Meanwhile, the output utility can be maximized by 
the utility objective function (some options are given in Section|5j- 
Thus, we can formulate the utility-maximizing problem to com- 
pute the optimal output counts of all query-url pairs for the random 



Input Search Log 



User- 
ID (s k ) 


Click- through 
query-url pair (q-, u±) 


Count 

(Cifk) 


081 


pregancy test nyc, 
medic inenet .com 


2 




book, amazon.com 


3 




google, google.com 


15 


082 


car price, kbb.com 


2 




google, google.com 


7 


083 


google, google.com 


17 




diabetes medecine, 
walmart.com 


1 




book, amazon.com 


1 




car price, kbb.com 


5 



Compute the 
optimal output 
counts of all the 
query-url pairs 



Multinomial Sampling 



Output Search Log 



Click- through 
query-url pair 
(q t ,Uj) 


Optimal 
output count 


Sampled 
User-IDs 
(sampled times) 


pregancy test nyc, 
medicinenet.com 







book, amazon.com 


3 


2-»081 (2), 
083 (1) 


google, 
google.com 


20 


20^081 (8), 
082 (3), 083 (9) 


diabetes medecine, 
walmart.com 







car price, kbb.com 


4 


3-»082(l), 
083 (3) 



(the santization/ 
randomization 
algorithm is 
guaranteed to be 
differentially 
private by the 
constraints w.r.t. 
the output counts 
of all the query 

-url pairs) (Maximize the output utility with a defined utility measure 
for the output counts of all the query-url pairs) 

(a) Sanitization with Multinomial Sampling 

Figure 1: An Example of the Sanitization Algorithm 



User- 


Click-through 
query-url pair (q ; , Uj) 


Count 

(V) 


081 


pregancy test nyc, 
medicinenet.com 







book, amazon.com 


2 




google, google.com 


8 


082 


car price, kbb.com 


1 




google, google.com 


3 


083 


google, google.com 


9 




diabetes medecine, 
walmart.com 







book, amazon.com 


1 




car price, kbb.com 


3 



(b) A Sample Output 



sampling (the optimal solution x* — {ix*j } achieves the optimal 
output utility and also satisfies differential privacy constraints). 



Algorithm 1 Sanitization Algorithm 

Input: search log D and differential privacy parameters (e, S) 
Output: sanitized search log O 

1 : Compute the Optimal Output Counts for all query-url pairs in the 

search log: {V(<ji,-Uj) e D^x*-}. 

I*** solve an optimization problem: define a utility objective function 
w.r.t. the output counts {\/xij} while {Vrr^} subject to some con- 
straints that ensures differential privacy for this algorithm, (the optimal 
solution is {Vx*.}) ***/ 
2: Generate the Output O: sampling user-IDs for every query-url pair 
(qi,uj) with x\j times multinomial trials (the probability of every 
sampled outcome in one trial is given by the input D). 



FigureQ]shows an example of AlgorithmQ] particularly the multi- 
nomial sampling after computing the optimal output counts of all 
query-url pairs {ix*j} (assume that {0, 3, 20, 0, 4} in the example 
is the optimal solution of an optimization problem that includes a 
utility objective and some constraints ensuring differential privacy). 
Therefore, our multinomial sampling has following properties: 

1 . The number of multinomial trials for (qi ,Uj)'s user-ID sam- 
pling is given as x*j (optimal solution x* = {Vx*j}). 

2. In every multinomial trial for any query-url pair (qi,Uj), 
the probability that any user-ID Sk is sampled, is Cijk/cij- 
Specifically, i.e. "car price, kbb.com" in Figure[T] the proba- 
bility that user 082 is sampled is 0+ 2 +5 . However, the prob- 
ability that user 081 is sampled for this query-url pair is 0. In 
addition, the expected value of every random variable Xijk 
can be derived as E(xijk) = Xij ■ SM*., Thus, given an 
output count x*j (optimal) for any query-url pair (q-i,Uj), 
the shape of the input/output query-url-user histograms w.r.t. 
only query-url pair (qi , Uj ) (illustrating the individual counts 
of (qi , Uj ) held by distinct users) should be analogous (this is 
guaranteed by multinomial distribution), i.e. the input/output 
query-mi-user histogram w.r.t. "google, google.com", even 
if the output count x*j = 20 < dj — 15 + 7 + 17 = 39, the 
shape of histograms (8, 3, 9} (in a randomized output, see 
Figure [T(b")} and { 15, 7, 17} (in the input) is similar. 

3. If V(gj, Uj), the, Input Support (denoted as dj/ 2~^v(q u ) Ci j )> 
is close to the Output Support (denoted as x ij / 2^v(q u ) x v~)' 



the shape of the output query-url histogram can be maxi- 
mally preserved. At this time, after sampling user-IDs with 
the above output counts of all query-url pairs (or called out- 
put query-url histogram), the shape of the output query-url- 
user histogram can be maximally preserved as well. 

Actually, one of our utility-maximizing problems is to seek 
the optimal output utility that minimizes the sum of the sup- 
port distances for all frequent query-mi pairs (see the defi- 
nition and details in Section [5T2l if pursuing the minimum 
sum of support distances for all query-url pairs, we can lower 
the minimum support threshold). Thus, once the sum of the 
support distances are minimized (utility-maximizing prob- 
lem can do so, i.e. it figures out that the distance between 
i V f%> = {0^,i,0,^}and{V^-} = {2^±o, 

34^ tlj 15+7+17 1 0+^ tl; 0±2±5 } {% minimized while sat . 

isfying some privacy guarantee constraints), the shape of the 
input/output query-url-user histograms can be analogous (i.e. 
see the counts in the left table of Figure |l(a)| and Figure[T(b)}. 

To sum up, if we compute the output count of every query-url 
pair x — {ixij} by solving an optimization problem (for vari- 
ables x = {Va?y}) that maximizes the output utility and also en- 
sures differential privacy for the sanitization algorithm, the output 
with optimal utility can be generated by sampling user-IDs for all 
the query-url pairs (the schema of Input/Output is indeed identical 
since we can sort the output by the sampled user-IDs, as shown in 
Figure \l(b)\ where the association between query-url pairs and the 
shape of query-url-user histogram can be preserved). 

4. PRIVACY GUARANTEE CONDITIONS 

Assume that 1Z is a sanitization algorithm that samples user-IDs 
for every query-url pair (qi,Uj) with its total output count Xij. 
Since the sampling procedures for all query-url pairs are indepen- 
dent, for any input D ({Vcy* } is given) and a possible output O 
({^Xijk} is also given), the probability Pr[lZ(D) = O] can be 
computed in terms of the probability mass function of multinomial 
distribution (2): 

p r [K(D)=o]= n n (c '^ c 'f' Jfc i a) 

Indeed, Pr\TZ(D) — O] is determined by Xij and {Vsfc 6 
D, and Xijt}. Given input D, {Wst, -^ L } are constants. 



Hence, if V(gi, Uj) G D, the output count Xij is determined, we 
can compute the probability Pr\TZ{D) = O] for any output O G fi 
(Vxijfc are fixed in O). Therefore, given any pair of neighboring in- 
puts D and D' that differ in one user log, bounding the probabilities 
per Definition [2] for a divided output space fi can be transformed 
to the problem: determining a feasible solution x = {Vxy} in the 
output that satisfies all the probability bounding conditions in Def- 
inition [2] for an output space split fi = fii U Q.2- Using this we 
can formulate the constraints (satisfying differential privacy) for 
variables: the counts of all query-url pairs x = {Vxy} in all the 
possible outputs O G fi. 

Without loss of generality, we let D — D' + Ak where D and 
D' differ in an arbitrary user s k 's user log A k . Thus, we first derive 
the probabilities in Definition[2]for all O in the output space fi, and 
then deduce the constraints for satisfying differential privacy. 

4.1 Probabilities in Definition |2] 

Due to D — D' + A k , the user- ID s k might be sampled into the 
output O if starting from D. Thus, for all outputs O which con- 
tain sk, we have Pr[R.(D') = O] = (since s fe £ D'). Recall 
that, given Ak — D — D' (or Ak = D' — D), we can only divide 
the output space fi into two sets fii and Q.2 as: (1) every output 
O in fii includes a k ; (2) every output O in fi2 does not include 
Sk, because fii should includes all the exceptional outputs that vi- 
olates e-differential privacy. We thus bound the probabilities per 
Definition [2] for the above output space split of any two neighbor- 
ing inputs (VO G fix, user-ID s k 6 O and s k (£. Q2) to achieve 
differential privacy. 

4.1.1 forallOeQi 

Since VO G fii where s fc G O, we have Pr[TZ(D') = O] = 0. 
Thus, the probability Pr[R.(D') G fii] is also equal to 0. We now 
compute the probability Pr\TZ{D) G fii]. 

Specifically, to generate any possible output O including user- 
ID s fe from D, the probability Pr[TZ(D) = O] (where O G fii) 
is equal to the probability that "sfc is sampled at least once in the 
multinomial sampling process of all the query-url pairs in Ak ". For 
every query-url pair (qi, Uj) G Ak, if its total output count in the 
sampling is Xij, the probability that Sk is not sampled in a single 
multinomial trial (a user- ID in D except Sfc is sampled) is — — 

simply because user Sk holds (qi, Uj) with the count Cijk and the 
total count of (qi, Uj) is Cij in the input D. Since V(gj, Uj) G Ak 
may lead to that Sk being sampled and the multinomial sampling 
for every query-url pair (qi,Uj) includes Xij independent trials, 
we have Pr[s k is not sampled] = Ylv( ti , Uj )eA h ( ^'Zi^ T i} • Fi " 
nally, we can obtain the probability that Sfc is sampled at least once: 
Pr[s k is sampled] = 1 - n vfe) „ j)e A k ( C -^^)^ . 

Thus, we can derive the probability Pr\TZ(D) G fii] as below: 



Pr[1Z(D) £ Ox] = 1 - (— 



(2) 



One important issue is worth noting in multinomial sampling. 
For any query-url pair (qi, Uj) G A k where c ijk = cy ((qi, Uj) is 
unique and only belongs to user s&), if its output count > 0, the 
probability Pr[JZ(D) G fix] should be equal to 1 which cannot be 
bounded. Therefore, we let Xij = for this case and all the unique 
query-url pairs in the input should be removed. 

4.1.2 forallOen 2 

For any output O G ^2, we discuss the ratios pr [ K(D*)^o ] an( ^ 
F prYri(D)=o] ( smce O does not include Sk, we have Pr\TZ(D) = 



O] > and Pr[K(D') = O] > 0). 

Intuitively, for all query-url pairs that belong to both Ak and 
D', sampling user- IDs from D involves an additional candidate Sk 
(but Sfc ^ O) compared with sampling user-IDs from D'. We thus 

^JW^[ < 1 and T&0Z\ > L Slnce the rati0 
Pr[TZ(D>y^o] ' s bounded by 1 (and obviously e E ), we only need to 

derive the ratio P p r ^ ( D D )= Sj ; 

As mentioned in Section |4.1.1| all the query-url pairs in D (and 
Ak) but not in D' should be not be retained in the output. Thus, 
to generate O from D, we only sample user-IDs for the common 
query-url pairs of D and D' . Two categories of common query-url 
pairs can be identified in D' (D' C D here): (1) query-url pairs in 
D' but not in Ak (2) query-url pairs in D and also in Ak- 
in the first category, V(q;, Uj) in D' but not in Ak, the probabili- 
ties of sampling user-IDs for (qi,Uj) from D and D' are equivalent 
because the query-url-user histogram w.r.t. these query-url pairs in 
D and D' is identical. We denote the ratio of these two probabili- 
ties as P ;^ D ]Zo\ (ij) *at is equal to 1. 

In the second category, V(gi, Uj) in D' and also in Ak, we can 
consider every sampled user-ID in the process of 7Z(D) — > O into 
two cases: "Sfc is sampled or not". In every multinomial trial for 
(qi,Uj), the probability of sampling Sfc is while the proba- 

c ij 

bility of sampling another user-ID in D (also in D') is 1 —. 

Since the number of (qi,Uj) in the output is (x^ times inde- 



pendent trials), we have ratio 



Pr\n(D') = Q} 
Pr[lZ(D) = 0] 



^ r _ ■> ^ % 3 

( — Y ij (since O does not contain Sfc, s k should not be sam- 

pled in Xij times independent trials when generating O from D). 

In sum, to generate any output O £ Q.2 from D and D' respec- 
tively, it is independent to sample user-IDs for all the above two 
categories of query-url pairs. Thus, VO G Q.2, p^[tcod)=oJ ~ 



T-r Pr[R-{D')=Q] (ii 



Since ' 



IVCiji.UjOeD' Pr[TZ(D)=0} 

PrK^' fa") = 1. we have VO G fi 2 : 



G D' but £ Afc 



Pr[n(D') = O] 
Pr[TZ(D) = O] 



n 



(- 



(3) 



v<.q i ,u j )e.D'nA k 



Cijk 



4.2 Differential Privacy Constraints 

(e, 5) -probabilistic differential privacy (Definition |2j demands: 
for any input D, Pr\TZ(D) G fix] < S; for D's arbitrary neigh- 
boring input D' and VO G fi 2 , l/e e < p^°')^o] ^ e£ - We 
now show that proving the randomization algorithm to be (e, 5)- 
probabilistic differentially private as per Definition [2] is equivalent 
to ensuring that the output counts of all query-url pairs satisfy a set 
of conditions. Theorem[T]is proven in Appendix lAl 

THEOREM 1. The randomization algorithm 1Z achieves (c,S)- 
probabilistic differential privacy if for any input search log D, the 
output counts of query-url pairs x — {V(gi, Uj) G D, Xij} satisfy: 

1. if '3 triplet (qi, Uj , s k ) G D such that Cijk ~ Cij, then xy = 
(do not output unique query-url pairs); 

2. for all A k C D: n^^-JX—)"* < e<; 

3. for all Ak CD: I- H^^^^ii.)^ < 5. 

As a result, we can utilize these conditions to formulate utility- 
maximizing problems in our differentially private search log san- 
itization. Specifically, we can implement Condition 1 while pre- 
processing the input search log (removing all the unique query-url 



pairs), and regard Condition 2 and 3 as Differential Privacy Con- 
straints in the sanitization. As soon as they are satisfied, the sani- 
tization should be (e, S) -probabilistic differential private for every 
pair of neighboring search logs that differs in only one user log. 

Note that while our multinomial sampling process is differen- 
tially private, the computation of the counts (a;* = {VxL }) is not 
necessarily so. To make the whole (end-to-end) sanitization differ- 
entially private, we must ensure that the count computation step is 
also differentially private. One simple way to do this is to use the 
generic procedure of adding Laplacian noise to the counts derived 
from the optimization model (a;* = {Va;*,}). Since the count com- 
putation can be viewed as a query over the input database, adding 
Laplacian noise will make the computation differentially private. 

Specifically, similar to Korolova et al. 1191 , if the count dif- 
ferences of every query-url pair (qi,Uj) in the optimal solutions 
derived from two neighboring inputs (D,D') are bounded by a 
constant d, computing optimal counts can be guaranteed to be e - 
differentially private 1191 (e' is the parameter of ensuring differen- 
tial privacy for such step) by adding Laplacian noise to the optimal 
count of every query-url pair: V(qi,Uj),x*j <— x*» + Lap(d/e'). 
Essentially, given d, we can simply bound the difference of every 
query-url pair's optimal count (computed from any two neighbor- 
ing inputs D, D') by executing the following preprocessing proce- 
dure for every user log A k in the input database (D or D'): 

1. formulate two utility-maximizing problems (pick the same 
option as the following sanitization) with neighboring inputs 
D and D — A k (or D' and D' - A k if D' is the input) re- 
spectively, and solve them. 

2. if the count difference of any query-url pair in both optimal 
solutions is greater than d, remove Ak from D (or D') '. 

If applying the above preprocessing procedure to any two neigh- 
boring inputs D and D' , and computing the optimal output counts 
with the updated D and D', the difference of every query-url pair's 
optimal count can be bounded by d. Thus, adding noise Lap(dfe') 
can ensure e' -differential privacy 1191 for the step of computing 
optimal counts in AlgorithmQ] While adding noise may distort the 
optimality to some extent, this is the price of guaranteeing complete 
differential privacy. Since adding Laplacian noise is a well-studied 
generic approach, we do not discuss this differential privacy guar- 
antee due to space limitation, and the sanitization/randomization 
algorithm refers to the sampling process in this paper. 

4.3 Indistinguishability Differential Privacy 

Recall that in Section IJTI we have noted that probabilistic dif- 
ferential privacy [ 24 10 1 provides stronger privacy guarantee than 
indistinguishability differential privacy (6| 1191 . Particularly, the 
probabilistic differential privacy notion has following property: 

PROPOSITION 1. Probabilistic differential privacy implies in- 
distinguishability differential privacy in our search log sanitiza- 
tion: if all the conditions in Definition\2\are satisfied with parame- 
ters (e, 8), the following two inequalities also hold: 

1. Pr{n(D') G 6] < e e ■ Pr[Tl{D) G 6} + 5; 

'The optimization problems result from any two neighboring in- 
puts (especially the large neighboring inputs) generate similar op- 
timal solutions. Thus, if d is not too small, the output count differ- 
ence can be bounded by d. Otherwise, if d is required to be suffi- 
ciently small (for reducing sensitivity/noise), we remove some user 
logs (that cause large differences in two optimal solutions). This 
allows us to trade off utility for end-to-end differential privacy. 



2. Pr[TZ(D) ed]<e c ■ Pr[TL{D') G O) + 5. 
where O is an arbitrary set of possible outputs and OCR 

Gotz et al. prove Proposition Q] and show that the converse of 
it does not hold in 1 10 1 (The proof of Proposition Q] is also given 
in Appendix [Bj. Hence, satisfying Definition [2] with the differen- 
tial privacy constraints (Theorem!]} provides more rigorous privacy 
guarantee than the work of Korolova et al. 1191 . 

5. UTILITY-MAXIMIZING PROBLEMS 

While search logs consist of millions of queries and click-through 
urls, from the perspective of utility, clearly, all are not equal. In- 
deed, from an application perspective, only a small portion may be 
useful with regards to a specific purpose. For instance, only the fre- 
quent query-url pairs are useful for query recommendation. Hence, 
different data usage purposes may result in different requirements 
for extracting data from the original search log. To privately sani- 
tize search logs while retaining maximal utility, we need to evaluate 
the data utility according to the usage requirement. In this section, 
we introduce three utility-maximizing problems with three differ- 
ent utility definitions. 

5.1 Maximizing the Output Size 

Before formulating the utility-maximizing problems, we first present 
the differential privacy constraints. As stated in Theorem [T] our 
sanitization algorithm satisfies (e, <5)-probabilistic differential pri- 
vacy if three conditions for the output counts of all query-url pairs 
are satisfied. Specifically, Condition 1 should be implemented in 
the preprocessing step 2 while Conditions 2 and 3 give two sets of 
constraints for the output counts of all query-url pairs, x = {xij}: 

|v4cAnv„„„ ))6Jlt (^)^<^ 

s.t. I VA k C D, 1 - n v(w , tt , )e ^(3*^)*« < S 
I ViEij > and is an integer 

Intuitively, the differential privacy constraints can be transformed 
into linear constraints: (constant tuk = — — ; each user log 

c ij c ijk 

Ak's two constraints can be combined as min{e, log rrr}) 

s t | Vj4fe C °' ^ v <9..« J )e4 Xlj ' lo S^ fc ^ min { £ ' lo S T=s) 
\Vxij > and x%j is an integer 

(4) 

In the above differential privacy constraints (each user log gen- 
erates a constraint): due to "it ah = — — — > 1, the coefficient 
of all the linear constraints Vlogtyfc should be greater than (all 
unique query-url pairs have been removed). Letting Mx < b be 
the above differential privacy constraints, all the elements in the 
constraint matrix M are non-negative and all the elements in b are 
equal to min{e, log yr^}- Thus, we have: 

STATEMENT 1. Differential privacy constraints (Equation @ 
are always feasible and bounded. 

We show the above property from the geometric perspective of 
linear constraints. Specifically, linear constraints {Mx < b, x > 
0, b > 0} form a convex polytope, which is always feasible and 
bounded if M, b > [26|. i.e. in Figure [2(a)] (two differential 
privacy constraints are generated by two user logs which includes 
three distinct query-url pairs), the feasible region of { Mx < b, x > 

2 For all unique query-url pairs, we let the output count be (for 
satisfying Condition 1 in TheoremQ}. 



0, b > 0} is formed as polytope OABCDE by two constraints 
(the space below planes AFH and GCD). Similarly, in Figure [2(b)] 
(three differential privacy constraints are generated by three user 
logs which includes two distinct query-url pairs), all the solutions 
in the feasible region OABC (the region below AD, FC and EH) 
satisfy all the differential privacy constraints. For more variables 
and constraints, more hyperplanes would form the polytope that is 
still feasible and bounded [251. 





(a) 3 query-url pairs 
in 2 user logs 



(b) 2 query-url pairs in 
3 user logs 



Figure 2: Differential Privacy Constraints 

One interesting point worth noting is that the size of the out- 
put (the total number of all users' query-url pairs in the output) is 
bounded by the differential privacy constraints. If we regard the 



output size 



as the utility objective function, we 



can use the following problem to seek the optimal output utility: 



E 



V f9i ,7i-)6D 



VA, 



C D 'T.v( qi ,u j )eA k X H ■ l °Stijk < min{e,lo gT -^} 



Vxij > and a;y is an integer 

We define the above problem as "Output size Utility-Maximizing 
Problem" (O-UMP). Since it is an integer linear programming (ILP) 
problem, we can solve it using some standard method (such as 
simplex algorithm) with linear relaxation 1261 (the LP problem is 
always feasible and bounded). After solving it (optimal solution 
x* — {VL^ij -J }), for every (qi, Uj), we sample user-IDs with \x*j\ 
times multinomial trials (the input query-url-user histogram pro- 
vides the probability of every sampled outcome in one trial). The 
sanitization algorithm satisfies Definition[2](Proof in Appendix ID"t. 

LEMMA 1. The O-UMP based sanitization algorithm satisfies 
(e, 5) -probabilistic differential privacy for any pairs of neighboring 
input search logs. 

Since the optimal solution x* = {Vaiy} satisfies the differential 
privacy constraints, the randomization algorithm based on the lin- 
ear relaxed solution should be also differentially private (y[x*j\ < 
x*j, thus V|j£jiJ strictly satisfies the constraints Mx < b where 
M, b > 0). Note that if we require adding Laplacian noise to 
{Vx*j } to ensure differential privacy for the step of computing op- 
timal counts, we cannot always guarantee that the noise-added op- 
timal solution satisfies the differential privacy constraints, though 
this is likely (since the mean of added Laplacian noise is 0). Mean- 
while, since the amount of noise Lap(d/e') is directly propor- 
tional to d (privacy parameter e' is fixed), d can be lowered to 
the preferred value (reducing the sensitivity/amount of noise) to 
gain closer approximation of strict end-to-end differential privacy. 
These also apply to the following utility-maximizing problems. 

5.2 Optimal Utility of Frequent query-url Pairs 

Top frequent click-through pairs in search logs have better util- 
ity fT2 'l than abnormal query-url pairs for improving the quality of 
search results or enforcing the search with recommendations and 



suggestions. Retaining frequent query-url pairs in the sanitized 
search logs can be a basic and practical goal of seeking the opti- 
mal output utility in the sanitization. We denote this problem as 
"Frequent query-url pair Utility-Maximizing Problem" (F-UMP). 

First of all, we denote |D| as the size (the total number of query- 
url pairs) of the input search log D. Thus, frequent query-url pairs 
can be identified using its Support in D: given a minimum support 
threshold s, if jgj > s, then (qi,Uj) is a frequent click-through 
query-url pair in D. Since the support of a frequent query-url pair 
explicitly indicates its importance in the search log, the support 
of all the frequent query-url pairs should be preserved as much as 
possible. In other words, the support of every frequent query-url 
pair in the output O should be close to its support in the input D 
(\D\ does not include the number of unique query-url pairs which 
should be removed in the preprocessing step). 

Thus, we can define the objective function as minimizing the 
sum of support distances for all the "frequent query-url pairs" in 
the input search log D: 



"\0\ mi 



V(qi ,Uj )GD where jj4r > s 



\D\ 



(5) 



where \0\ = ^ 



Xij is the size of the output O. 



With this objective, we formulate the F-UMP using the differen- 
tial privacy constraints as below: 



E 



V ( q i , Uj )£D where - 



|0| 



\D\ 



s.t. 



fVAfe C D,T,v(q z , Uj )eA k x m ■ logt ijk < min{e 
[ViEij > and Xij is an integer 



; 1-8 } 



Generally, since every query-url pair's support in D and O are 
two ratios, pursuing the minimized sum of support distances (our 
objective in F-UMP) cannot always guarantee an output with good 
frequent query-url pair utility (i.e. the number of all frequent query- 
url pairs are very small, but the support of them are close to the orig- 
inal one). Alternatively, we can specify a fixed output size \0\ in 
the sanitization and seek the optimal utility for the frequent query- 
url pairs. Recall that O-UMP can generate the output with the max- 
imum size for any input D and fixed parameters (e, 8) {we denote 
the maximum output size as A). Thus, to preserve sufficient output 
size, we can solve the F-UMP with a specified constant output size 

|O|e(0,A]. 

STATEMENT 2. F-UMP can be considered as an integer linear 
programming (ILP) problem if we fix the output size \0\ as a con- 
stant and standardize the absolute values in the objective function. 



First, due to \0\ 



^{q i ,u j )eD Xi i> 'f we s P ec ify the size of 
the output in the sanitization, — can be considered as lin- 
ear. Second, we can transform the absolute values in the objective 
function in a standard way: 

1. create a new variable yij for every frequent query-url pair 
V(5<, Uj) where ^ > s: yij = _ fgL ; 

2. generate two new constraints for every yij : yij > -0r — ^ 
wdyij >^-jg- r 

As a result, F-UMP can be transformed into an integer linear 
programming (ILP) problem as below: 



min : ^ 

y{qi ,itj)£D where jj^j>s 

'VA fe C 0,Ev(,„^)eA t X H ■ l ogt ijk < minle.log^} 

s.t.< V(qi,Uj) where ^ > s,y y > ^ - ^ 
V(qi,Uj) where ^ > > ^ - ^ 

Va;^- > and Xij is an integer 

Similar to O-UMP, we can solve the above ILP problem using 
some standard methods such as Simplex algorithm with linear re- 
laxation 1261 (if \0\ is specified to be no greater than A, the ILP 
problem should be feasible and bounded). 

Overall, in F-UMP based sanitization, we can specify an appro- 
priate output size \0\ G (0, A], solve the ILP problem (optimal 
solution x* = {V[a;*jJ}) and generate the optimal output utility: 
the Input/Output Support of all the frequent query-url pairs tends 
to be close (only counting the non-unique query-url pairs) and the 
output size can be assured as well. Finally, we sample the output 
with the optimal solution of F-UMP: for every (qi,Uj) (either fre- 
quent or infrequent), we sample user- IDs with [x*j J times multino- 
mial trials (equally, the input query-url-user histogram provides the 
probability of every sampled outcome in one trial). As discussed in 
Section fOl the shape of query-url-user histogram can be preserved 
in this problem based sanitization algorithm. Also, the sanitization 
algorithm satisfies Definition[2](Proof in Appendix IDl. 

LEMMA 2. The F-UMP based sanitization algorithm satisfies 
( e , 5) -probabilistic differential privacy for any pairs of neighboring 
input search logs. 

5.3 Maximizing query-url Pair Diversity 

Occasionally, more distinct query-url pairs exhibit better utility, 
we can formulate the "Diversity Utility-Maximizing Problem" (D- 
UMP) in search log sanitization. The diversity of search logs nor- 
mally has two facts: the diversity of search queries and the diversity 
of query-url pairs. Since we investigate the potential privacy breach 
from every query-url pair (finer-grained than search queries), we 
denote the diversity utility of search logs as the number of distinct 
query-url pairs. (Indeed, we can also model search query diversity 
maximizing problem in a similar way.) 

In our sanitization, x%j represents the count of query-url pair 
(qi,Uj) in the output O. To evaluate the diversity of the sanitized 
search log O, we can introduce another variable yij for every Xij. 

j Vij = 1, if Xij > 

\ yij = 0, if =0 w 

We thus define the utility function as max : ^ y,j . Moreover, 
given a large constant H > max{Vcij}, Equation[6]is guaranteed 
to hold by the following inequalities: 

Vfe, Uj), x i:j < yij ■ H 

V(qi,Uj), > yij (7) 

yij G {0, 1}, Vxij >0,H> rnax{\/cij} 

As a result, D-UMP can be formally defined as: 

max : ) j yij 

V( 9i ,^)G-D 

iVA fe C AEvte.u^eAfc X H ■ lo Skjk < mmje.log^} 
\/(qi,Uj) G D,Xij < ■ H 
V((?i,Uj) G D,Xij > y^ 
H > max{\/cij}, > and is an integer, yij G {0, 1} 



Essentially, letting Vx^ € {0, 1} and Xij = j/y, the above 
mixed integer programming (MIP) problem can be transformed to 
a simplified binary integer programming (BIP) problem (see Equa- 
tion^. Both problems have the same optimal solution for variables 
y = {Vj/ij}. (We prove Theorem[2lin Appendixlcl 

THEOREM 2. The optimal solution y* = {\/y*j } of the BIP 
problem is equivalent to the values {iy*j } in the optimal solution 
{x* ,y*} = {ix* 3 - , \/y*j } of the MIP problem. 

max : Vij 

s t | VAfe C D '^V( 9 ,, llj )GA fc Vij logtijfc < min{e,log j^} 
1 H > max{cij }, Vj/y G {0, 1} 

(8) 

After solving the simpler BIP problem rather than the MIP prob- 
lem (both problems are feasible), we thus let \f(qi,Uj) G D, Xij = 
yij G {0, 1} be the optimal solution of D-UMP (sampling user-IDs 
in only one trial for every query-url pair in the output. Similarly, 
the input query-url-user histogram provides the probability of every 
sampled outcome in one trial). 

However, both BIP and MIP problem are NP-hard |26|. For 
large-scale D-UMP, we propose an effective and efficient heuris- 
tic algorithm to solve the BIP problem in Algorithm [2] It seeks an 
approximate optimal value for the BIP problem. We iteratively re- 
move sensitive query-url pairs (let yij — if yij has a maximum 
positive coefficient tijk in the sparse constraint matrix). We elim- 
inate these query-url pairs since they belong to a certain user with 
the highest percent in the count histogram of the triplets query-url- 
user (sensitive to the corresponding user. i.e. if user holds 90% 
of (qi,Uj), tijk should be large). The algorithm terminates until all 
the differential privacy constraints are satisfied. 



Algorithm 2 Sensitive query-url Pair Eliminating (SPE) Heuristic 

Input: search log D and differential privacy parameters (e, S) 
Output: optimal solution for D-UMP y* = {Vj/y } 

1: remove all the unique query-url pairs from D (preprocessing). 

2: for every (qi,Uj) G D do 

3: Vij <- 1. 

4: while true do 

5: find the maximum ta^ = u from the constraint matrix. 

c ij~ c ijk 

6: let yij <— for the maximum tijk- 

7: ifVAfc, Evte.^jeAfc Vij^Ujk < min{e, log -L} then 

8: break 

9: return y* = {Vy*A. 



The sanitization algorithm based on D-UMP also satisfies Defi- 
nition [2] (Proof in Appendix [D} 

LEMMA 3. The D-UMP based sanitization algorithm satisfies 
(e, 5) -probabilistic differential privacy for any pairs of neighboring 
input search logs. 

6. EXPERIMENTAL RESULTS 
6.1 Experiment Setup 3 

3 Since the published search logs in 1191 and 1101 do not include 
pseudonymous user-IDs for associating distinct query-url pairs in 
every user's search history, the utility of our sanitized search logs is 
incomparable with their work. Moreover, since Laplacian noise has 
been well evaluated in their work, we focus on testing the optimal 
utility w.r.t. the output counts of all query-url pairs. 



Dataset. In our experiments, we utilize the AOL real search log 
t3l 1111 to test our utility-maximizing problems. Our experimen- 
tal dateset is extracted from one subset of AOL data. Specifically, 
we randomly pick 2500 out of over 65000 user logs in the selected 
AOL data. We remove all the unique query-url pairs (appear in only 
one user log) from the selected dataset in our preprocessing step. 
Thus, Table[3]presents the characteristics of the AOL dataset (only 
collect the tuples with clicks), our randomly selected dataset and 
the preprocessed dataset. 6043 distinct query-url pairs is held by 
1980 users in the preprocessed dataset (since search logs which are 
extremely diverse include large number of unique query-url pairs, 
most of the existing work 1 19 10 1 cannot maintain the entire output 
diversity either). Thus, we have 6043 variables and 1980 differen- 
tial privacy constraints in our UMPs. 

Table 3: Characteristics of the Data Sets 





AOL 
Dataset 


Exp. 
Dataset 


Preprocessed Dataset 
(without unique pairs) 


# of total tuples (size) 
# of user logs 

# of distinct queries 
# of distinct urls 

# of query-url pairs 


1,864,860 
51,922 
583,084 
373,837 

1,190,491 


237.786 
2,500 
83,130 
82,076 
163.681 


53,067 (|D|) 
1,980 (Constraints) 
4,971 
4,289 
6,043 (Variables) 



Experimental Parameters Setup. To observe the tuning of 

differential privacy parameters (e, 5), we let S = {10 -4 , 10~ 3 , 
IO -2 , 10 -1 , 0.2, 0.5, 0.8} ande £ = {1.001, 1.01, 1.1, 1.4, 1.7, 2.0, 
2.3} in all three utility-maximizing problems. Furthermore, F- 
UMP requires two additional parameters: the minimum support s 
and the output size \0\ (|0| < A and A is given as the optimal value 
of O-UMP). We let s = {-L ^, ^, j^}. For every pair 
of e and 5, we compute A in O-UMP and specify an appropriate 
output size \0\ in F-UMP. 

Experimental Platform. All the experiments are performed on 
an HP machine with Intel Core 2 Duo CPU 3GHz and 3G RAM 
running Microsoft Windows XP Professional Operating system. 
While solving D-UMP, we also submit the AMPL format of the 
BIP problems to three NEOS solvers (qsopt_ex, scip and feaspump 
1 16 1) running online in addition to locally running our heuristic. 

6.2 Maximum Output Size A 

With the preprocessed dataset (\D\ = 53067 as shown in Table 
|3), we can compute the maximum output size A using O-UMP 
for a given pair of differential privacy parameters (e £ ,<5). Table 
|4]presents the maximum output size (the optimal value of O-UMP) 
for different pairs of (e £ , 5) where O-UMP is solved by Matlab 
function linprog. To generate the output O, we can sample user- 
IDs for every query-url pair according to the optimal solution (6043 
variables/query-url pairs). \0\ can be maximized while the entire 
process satisfies (e, 8) -differential privacy. We can obtain 7.08%- 
26.2% of the original size with the given parameters. Due to the 
highly diversity and sparseness of search log data, this percent of 
output size is sufficient good for differential privacy guaranteed 
sanitization algorithms. 

Table 4: Maximum Output Size A on e £ and 5 (\D\ = 53067) 



e £ \<5 


io- 4 


io- a 


10 - * 


io- 1 


0.2 


0.5 


0.8 


1.001 


3759 


4007 


4007 


4007 


4007 


4007 


4007 


1.01 


3759 


4007 


4879 


4879 


4879 


4879 


4879 


1.1 


3759 


4007 


4891 


8382 


8382 


8382 


8382 


1.4 


3759 


4007 


4891 


8874 


10445 


11419 


11419 


1.7 


3759 


4007 


4891 


8874 


10445 


12438 


12438 


2.0 


3759 


4007 


4891 


8874 


10445 


13088 


13088 


2.3 


3759 


4007 


4891 


8874 


10445 


13088 


13901 



6.3 Optimal Utility of Frequent query-url Pairs 

Recall that F-UMP based sanitization generates outputs with the 
minimum sum of the support distances of all the frequent query- 
url pairs. Thus, we examine the maximum frequent query-url pairs 
utility with three measures: the optimal value of F-UMP (minimum 
sum of the support distances, see Equation [5J, the Precision and 
Recall of the frequent query-url pairs in the input/output. Precision 
and Recall are defined as below: 

\So n s\ D _„_\s ns\ 



Precision 



, Recall = 



(9) 



\S\ ' \S \ 
where So and S denote the set of frequent query-url pairs in D 
and O respectively, and | ■ | means the cardinality of the set. Specif- 
ically, Precision is defined to evaluate the fraction of the frequent 
query-url pairs in the output that are originally frequent in the in- 
put with the same minimum support. Recall is defined to evaluate 
the fraction of the frequent query-url pairs in the input that remains 
frequent in the output with the same minimum support. 

To evaluate the performance of F-UMP in differentially private 
search log sanitization, we run two groups of experiments. First, we 
fix the output size and the minimum support as: \0\ = 3000 < A 
and s = g 1 ^, and test the (measurement) results with different 
pairs of (e, S). Second, we fix the differential privacy parameters 
as: e s = 2, 5 = 0.5 (A = 13088, as shown in Table 0, and 
test the results with different minimum support s and output size 
\0\. One essential point worth noting is that the minimum sum of 
support distances is an effective measure in the first group of ex- 
periments because the minimum support s is fixed and the original 
frequent query-url pairs in the input has been determined for all dif- 
ferent pairs of e and 5 (thus the sum of the support distances for all 
the frequent query-url pairs in the input is comparable). However, 
in the second group, the set of original frequent query-url pairs is 
varying for different s, hence the objective values of F-UMP is in- 
comparable on a varying s. Therefore, we use the average of the 
support distances for all the frequent query-url pairs in the input in 
addition to the sum of them in the second group of experiments. 

Interestingly, in all our F-UMP experiments, Precision is always 
equal to 1, which means all the frequent query-url pairs in the out- 
put are also frequent in the input with the same minimum support 
s. This is quite reasonable: suppose that (qi,Uj) is not a frequent 
query-url pair in the input where ¥jL < s, if it is frequent in the 

output where r^r > s, the solution of F-UMP must be not optimal 
(reducing tM to might improve the objective value and does 
not violate differential privacy constraints). 

In the first group of experiments, Figure [3(a)] and [3(b)] demon- 
strate the Recall and Sum of the Support Distances for all the fre- 
quent query-url pairs in the input. Fixing S, Recall increases as e 
increases until e = log j-L_ . Fixing e > log j-L_ , Recall increases 
as S increases; fixing e < log Trj. Recall stays invariant even if 
S is increasing. By contrast, the sum of support distances has an 
inverse increasing trend on varying e and S. 

Table 5: Recall on Output Size \0\ and Minimum Support s 

(e e = 2,5 = 0.5, A = 13088) 
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4000 



5000 



6000 



7000 



8000 



0.8873 
0.8095 
0.9143 
0.9116 
0.933 



0.8189 
0.8762 
0.9143 
0.8529 
0.8667 



0.874 
0.8571 
0.9286 
0.8529 
0.8 



0.8661 
0.8476 
0.9143 
0.8529 
0.8 



0.8583 
0.8952 
0.8857 
0.8529 
0.8 



0.8346 
0.8667 
0.8714 
0.8235 
0.7333 



In the second group of experiments, Table [5] presents the Recall 
on different pairs of outputs size and minimum support. As we 
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(b) Sum of Support Distances on (e, 8) 
Figure 3: F-UMP Performance 
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(c) Average Support Distance on (s, |0|) 



can see, over 80% of the frequent query-url pairs can be retained 
in the output with fixing e E = 2 and 8 = 0.5 (given more strict 
e e and 8, 30% of them can be retained as shown in Figure [3(a)} . 
In addition, Table [6] illustrates the sum of support distances for all 
frequent query-url pairs in the input (the same \0\ and s as Table 
[5j, Fixing s, the sum of support distances increases as the output 
size increases (they are comparable due to fixed s). This fact is 
true: given a fixed minimum support s, for the fixed set of frequent 
query-url pairs in the input, it is easier to achieve the minimum 
support without violating differential privacy constraints when \0\ 



is not too large (the ideal output count Xij is 



\0\ ■ Si 



and the 



output counts are bounded by privacy constraints, thus all frequent 
query-url pairs Vxjj are likely to achieve \0\ ■ jj^ if | O | is small). 
Finally, since the set of frequent query-url pairs varies for different 
s, we compare the average support distance instead of the sum of 
them for different s. As shown in Figure [3(c)] the average support 
distance decreases as the minimum support s increases (logarithmic 
scale minimum support s). Therefore, the frequent query-url pairs 
in the output is closer to them in the input if a larger minimum 
support is given in the F-UMP. 

Table 6: Sum of Freq. query-url Pair Support Distances on Out- 
put Size \0\ and Min. Support s (e £ = 2, 8 = 0.5, A = 13088) 



»\|0| 


3000 


4000 


5000 


6000 


7000 


8000 
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1050 

5J0 
2j0 
Ton 


0.0551 
0.0549 
0.0559 
0.0555 
0.0574 


0.085 
0.0854 
0.0865 
0.086 
0.088 


0.1058 
0.1116 
0.1048 
0.1043 
0.1063 


0.1279 
0.1271 
0.1247 
0.1236 
0.1246 


0.1485 
0.1477 
0.1448 
0.1393 
0.1392 


0.1785 
0.1767 
0.1716 
0.161 
0.1583 



6.4 Maximum query-url Pair Diversity 

6.4. 1 D- UMP Performance 

We now look at the performance of D-UMP (maximum diversity 
utility). Figure [4] shows the percentage of retained query-mi pairs 
in the output with the same parameters (e, 8) as F-UMP. The maxi- 
mum query-url diversity has a similar increasing trend as the Recall 
of F-UMP (Figure |3(a)fr . Moreover, the query-url diversity can be 
retained as high as 30%. Note: the input has been preprocessed by 
removing all the unique query-url pairs, and they are not counted 
in the denominator of the ratio. 

6.4.2 BIP Solver Comparison 

Since D-UMP is an NP-hard problem, we introduced an effective 
heuristic algorithm (Algorithm [2]l for this binary integer program- 
ming (BIP) problem with a sparse non-negative constraint matrix. 
We now compare the performance of our Sensitive Pair Eliminating 
heuristic (SPE) with some popular BIP solvers (Matlab bintprog 
function, Neos qsopt_ex, Neos scip and Neos feaspump 1161 ). 




Figure 4: Maximum Diversity on (e, 8) (Algorithm [2) 

Table 7: Retained Diversity Utility of Different BIP Solvers 

(a) e E = 2 
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30.3% 
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SPE (Heuristic) 


17.7% 


25.7% 


26.0% 


26.0% 


26.0% 


26.0% 


Matlab bintprog 
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Neos feaspump 
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25.8% 


25.8% 


25.8% 


25.8% 



As shown in Table [7] we collected the maximum percent of re- 
tained distinct query-url pairs using all the solvers with the same 
experimental inputs. We observe that our heuristic algorithm per- 
forms better than other solvers in most cases and the optimal values 
by all the solvers have quite similar varying tendency. Specifically, 
Algorithm [2] generates sanitized search logs with greater query-mi 
pair diversity than Matlab bintprog, NEOS qsopt and Neos scip. 
NEOS feaspump performs slightly better than Algorithm [2] only 
when (e £ =2,8 = 0.5) and (e e = 1.1, 8 = 0.1). 

Finally, we plot the computational costs for solving a typical D- 
UMP by all solvers in Figure [5] (e e = 1.7,5 = 10 -3 )). Since 
our Sensitive query-url Pair Eliminating (SPE) algorithm has the 
complexity 0(n 2 log mn) (constraint matrix size: m x n), it out- 
performs other solvers for our D-UMP in time complexity as well. 

6.5 Difference of Input/Output Histograms 

As described in Section [3~2l our multinomial sampling, partic- 
ularly the F-UMP based sanitization can retain the shape of the 
histograms in the output (generate similar count histograms for dis- 
tinct triplets: query-url-user (qt,Uj, SjO). We now examine this by 




(a) 101=4000 (b) 101=6000 

Figure 6: The Difference Ratio of Input and Output query-url-user (Triplets) Histogram (F-UMP based Sanitization) 
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Figure 5: Computational Performance for Solving D-UMP 
(e e = 1.7,5=10 -3 , Logarithmic scale runtime) 

comparing two histograms. 

Specifically, we generate 10 randomized outputs according to 
the optimal solution of F-UMP for two different output size \0\ = 
4000 and 6000 respectively (fixing e e = 2, 8 = 0.5, s = 1/500), 
and plot two bar plots in Figure [6] the X-axis varies from 0% to 
100% while the Y-axis represents the average number of distinct 
triplets (qi,Uj , St) 4 whose difference ratio of the input/output his- 
tograms (defined in Equation \10\ equals the values in the X-axis. In 
both Figure [6Ta)] and [6(b)l the percent of most triplets (qi , Uj , s/t) in 
the input/output varies within a tolerable bound (|0| = 4000, the 
difference ratio of about 75% triplets is below 40%; \0\ = 6000, 
the difference ratio of about 90% triplets is below 40%). 

r-i'-c-cn j • / * \ I x ijk/\0\ — Cijk/\D\\ 

DiffRatio(x tjk ,c ijk ) = — |j (10) 



7. CONCLUSION AND FUTURE WORK 

In this paper, we have addressed the important practical problem 
of retaining the maximum utility while the search log sanitization 
satisfies differential privacy and generates outputs with the identical 
schema as the original search log. As a necessary step, we have 
defined three different notions of utility that are useful for various 
applications. We have implemented our approach and validated it 
on several real data sets. 

We can extend our work in several directions. First, additional 
notions of utility can be considered and corresponding optimization 
models created. We also need to explore ways of combining differ- 
ent utility notions to create a single joint objective. This would be 
akin to a multi-objective optimization. Second, corresponding to 
the utility-maximizing problem, one can similarly define the pri- 

4 The triplets w.r.t. infrequent query-url pairs can be ignored in 
general. If s is sufficiently small, the shape of the query-url-user 
histogram w.r.t. all query-url pairs can be optimally retained. 



vacy breach-minimizing problem which asks for minimal privacy 
loss while satisfying a certain utility. Third, since we have modeled 
the utility-maximizing problems in the optimization framework, it 
should be possible to leverage the significant work in the field of 
operations research to solve these problems. We intend to explore 
these in the future. 
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APPENDIX 

A. PROOF OF THEOREM 1 

PROOF. Assume that D and D' differ in an arbitrary user s^'s 
user log Ak- In Section FCTl we discussed two sets of output spaces 
fi = fii U 0,2'- all the possible outputs in fit include s k whereas 
all the possible outputs in fi 2 does not include Sfc. Hence, if the 
probabilities inequalities in Definition[2]hold for the above fii , fi 2 , 
(e, 8) -probabilistic differential privacy can be guaranteed for the 
randomization algorithm with this output space split. 

First, according to Equation[2] if WA k C D, 1 — I"Iv(«- u)eA h 
f Cij-cijk y i:j < g (Condition 3) holdS; we have p r [ft(£>) e 

fii] < S for any input D. Meanwhile, Condition 1 guarantees 
that Pr[lZ(D) G fii] can be effectively bounded by 8. Otherwise, 
if a unique query-url pair(qi, Uj), given Xij > 0, Pr[lZ(D) G fii] 
should be equal to 1 with such output space split (no other space 
split available for any pair of neighboring input search logs). 

Second, for all O G fi 2 , we have Pr\R{D') = O] > and 
Pr[K(D) = O] > 0. If-D' C D, Condition 2 ensures p^D')^ 

< 1 < P p}\%( D D)=o] < e £ . On the contrary, if D C D', Condi- 
tion 2 derived from D' can also guarantees F p}^ljy l Zo] — 1 — 

Pr[K{D) = Q] < e 
Pr[K(D>)=0] — c ■ 

Thus, the randomization algorithm 1Z satisfies (e, 8) -probabilistic 
differential privacy (by dividing output space as above) if three con- 
ditions in the theorem hold. Note that the violation of any condition 
would result in unbounded multiplicative and/or additive probabil- 
ity difference (given e and 8) for at least one input D and/or one of 
its neighboring input D' (Differential privacy will not be guaran- 
teed), then the upper bounds e and 8 are tight. □ 

B. PROOF OF PROPOSITION 1 

PROOF. W.o.l.g., assume that two arbitrary neighboring search 
logs D and D' differing in one user log: D = D' + A k and O C fi 
is an arbitrary set of possible outputs. For any input D, we can 
divide the output space fi into two sets fii and fi2, such that (1) 
Pr[TZ{D) G fii] < 8, and for D,D' (2) VO G fi 2 , l/e E < 

Pr[K(D')=Q] < e 
Pr[1Z(D) = 0] — C • 

Let 6i = n fii and 6 2 = 6 n fi 2 , thus: Pr[1l(D) G 6] = 
/voso, Pr[H(D) = 0]dO + / VOe g 2 Pr[1Z(D) = 0]dO 

< Ivoen, Pr[1Z(D) = 0]dO + e e J VOe g 2 Pr[Tl(D') = 0]dO 
< S + e 'Uo e o 2 Pr[rZ(D') = Q]dO 

<S + e E Pr[TZ(D') G 2 ] < 5 + e e Pr[TZ(D') G O}. 



Similarly, we can prove that Pr[lZ(D') G O] < 8 + 
e e Pr[Tl{D) G 6]. 

This completes the proof. □ 

C. PROOF OF THEOREM 2 

PROOF. To distinguish two optimal solutions y* in the BIP and 
the MIP problem, we denote y* for the BIP and the MIP problem 

as (y*) B = {V(j/*j)b} and (j/*)ai = {V(2/*j)m}. 

• Suppose that 3 (y*-) B = 0, (y*j)M = 1 andVz / ij, (y*) B = 
(v*)m ({d*)b and {y*)hi differ in one variable). Due to 
(Vij)M = 1 and x*j > (j/*j)m, all the constraints WAk C 
D >i2v( qi ,u j )eA h (Vn)M ■ lo S*iife < mm { e . lo ST^} must 
be satisfied for (y*)_M- 

In addition, {y*j) M > (yt^B Y. v(qi , U])eD (ytj)M > 
Evfe.^JeoteJB' AsV(y* J ) s satisfies the constraints 
VA k C D,J2v( qiiUj)eAk (Vij)B -log t i:jk < min{e,log T ij} 
in the BIP problem, S V ( 94 ,«-)6£)(^) M should be the op- 
timal value for the BIP problem if other constraints are the 
same for two problems (due to J^-^^.^.^oiyi^M > 
Y^v( qi ,u )£D(yij) B y Hence, it is a contradiction. 

• Suppose that 3 (y*j)B = 1, (yij)M = OandVz / ij, (j/J)s = 
(jJz)m ({y*)B and (y*)M differ at one variable). Hence, 
the constraints MA k C D, J2y( qi , u -)eA k (Vv)b log t ijk < 
min{e, log j^rg} are satisfied in the BIP problem. In the MIP 
problem, if letting Xij be 1 for all (j/*j )b = 1, "iA k C D, 
Sv(,„ u ,) 6 A t ^ik^fe < min{e, log jJ^} can be equally 
satisfied. In this case, we have Sv( gi ,« )6£>(^) s 

= ^2v(, qi ,u 3 -)€D x v = 5^v(q i ,u j )eD(^i)- M 
> S!v(q i ,iij)ei3(yii) A:f 

(since V(gi, Uj) G D, Xij = (y,j)M)- Hence, (y*)i\i is not 
the optimal solution of the MIP problem. It is a contradiction. 

Therefore, Theorem 2 has been proven. □ 

D. PROOF OF LEMMA 1, 2 AND 3 

PROOF. It is similar and straightforward to prove Lemma 1, 2 
and 3 (probabilistic differential privacy) using TheoremQ] we thus 
prove them together. 

The sanitized search log O is generated in terms of the opti- 
mal solution of O-UMP, F-UMP or D-UMP. We sample the output 
based on the linear relaxed optimal solution x* — {lx*j\} (gives 
the total count) and the query-url-user histograms in any input D 
(gives the individual outcome probabilities). Due to V[x*jJ < x*j, 
we can infer that {V(gj,Uj) G O, \x*j\} satisfies the Condition 
2 and 3 of Theorem [T] (differential privacy constraints VAk C D, 

£vfe,« 3 .) 6 A, ^ lo S*«* ^ min ^' lo S T^} in °- UMP ' F " UMP 
or D-UMP are satisfied). Moreover, Condition 1 of Theorem [Tj is 
also guaranteed in the preprocessing step. 

Thus, while sampling user-IDs for any input search log D and 
its arbitrary neighboring input D' with the optimal counts (given 
by the optimization problem), we can divide the multinomial sam- 
pling output space fi (derived from D and D') into fii and fi 2 as 
described in Sectionf4]where all the probabilities in Definition|2]are 
bounded by e and 8 in such space split (refer Theorem [T}. There- 
fore, the O/F/D-UMP based sanitization (randomization) algorithm 
satisfies (e, <5)-probabilistic differential privacy (we can add Lapla- 
cian noise to ensure differential privacy for the step of computing 
the optimal counts if necessary). 

This completes the proof. □ 



